Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSHARP-5017: Retry KMS requests on transient errors #1541

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

papafe
Copy link
Contributor

@papafe papafe commented Nov 14, 2024

No description provided.

}
}
catch (Exception ex) when (ex is IOException or SocketException)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure here which other kind of network-errors related exceptions we could get

Copy link
Contributor

@adelinowona adelinowona Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some comment on the Drivers ticket for this feature that it only pertains to HTTP errors so for socket-level errors do we also retry on those? Other comments in the ticket seems to suggest we'll need to tell libmongocrypt to reset its state if we encounter socket errors while reading a kms response. Actually are http errors thrown as socket or IOException errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that comment is a little bit misleading. So, as far as I have understood the situation is like this:

  • The HTTP error are actually handled internally by libmongocrypt when we "feed" the response bytes to it. It reads the response and understands if it's an HTTP response with an error code.
  • The network errors instead need to be recognized by us. When we do, we need to call mongocrypt_kms_ctx_fail (that is used inside the new Fail method) to notify libmongocrypt of the network error.
  • Internally, the handling for both errors is the same, as it retries the request a certain number of times

@papafe
Copy link
Contributor Author

papafe commented Nov 14, 2024

This PR is still in a rough-ish state, and it's here mostly to get feedback on the approach used.
What is missing:

  • fix sync version of the methods
  • add sync version of the test
  • make the tests work on evergreen
  • some small cleanup

@@ -210,6 +210,13 @@ static Library()
() => __loader.Value.GetFunction<Delegates.mongocrypt_ctx_destroy>(("mongocrypt_ctx_destroy")), true);
_mongocrypt_kms_ctx_get_kms_provider = new Lazy<Delegates.mongocrypt_kms_ctx_get_kms_provider>(
() => __loader.Value.GetFunction<Delegates.mongocrypt_kms_ctx_get_kms_provider>(("mongocrypt_kms_ctx_get_kms_provider")), true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these and the following methods/variables, is there a specific ordering? It does not seem to be completely alphabetical so for now I've put the new things on the bottom

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a specific order for the methods, but many of them are grouped together. For instance, you'll notice that most of the setopt methods are clustered. For now, I recommend placing the mongocrypt_setopt_retry_kms method alongside the other setopt methods and leaving the other new methods at the bottom. You can add a TODO comment above the static constructor to organize the methods later (so you don't have unrelated changes to your PR). I'll handle that when addressing technical debt in Libmongocrypt later this quarter.

@@ -170,6 +170,13 @@ public KmsRequestCollection GetKmsMessageRequests()
return new KmsRequestCollection(requests, this);
}

//TODO I think we should remove the previous method and use this
public KmsRequest GetNextKmsMessageRequest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)


if (failureType == "network")
{
await SetFailure("network", 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: could probably just use failureType since we know it is network after the if statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but I would prefer to keep it this way to make it easier to read

{
{ "region", "foo" },
{ "key", "bar" },
{ "endpoint", "127.0.0.1:9003" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: replace "127.0.0.1:9003" here and other uses below with endpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely

@papafe papafe marked this pull request as ready for review November 19, 2024 12:47
@papafe papafe requested a review from a team as a code owner November 19, 2024 12:47
@papafe
Copy link
Contributor Author

papafe commented Nov 19, 2024

@adelinowona I've fixed also the sync part of the test and done some cleanup.
For me this is mergeable after a new version of libmongocrypt is released, because we need this small fix: mongodb/libmongocrypt#908

@rstam rstam removed the request for review from a team November 25, 2024 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants